14 research outputs found
MaxViT: Multi-Axis Vision Transformer
Transformers have recently gained significant attention in the computer
vision community. However, the lack of scalability of self-attention mechanisms
with respect to image size has limited their wide adoption in state-of-the-art
vision backbones. In this paper we introduce an efficient and scalable
attention model we call multi-axis attention, which consists of two aspects:
blocked local and dilated global attention. These design choices allow
global-local spatial interactions on arbitrary input resolutions with only
linear complexity. We also present a new architectural element by effectively
blending our proposed attention model with convolutions, and accordingly
propose a simple hierarchical vision backbone, dubbed MaxViT, by simply
repeating the basic building block over multiple stages. Notably, MaxViT is
able to "see" globally throughout the entire network, even in earlier,
high-resolution stages. We demonstrate the effectiveness of our model on a
broad spectrum of vision tasks. On image classification, MaxViT achieves
state-of-the-art performance under various settings: without extra data, MaxViT
attains 86.5\% ImageNet-1K top-1 accuracy; with ImageNet-21K pre-training, our
model achieves 88.7\% top-1 accuracy. For downstream tasks, MaxViT as a
backbone delivers favorable performance on object detection as well as visual
aesthetic assessment. We also show that our proposed model expresses strong
generative modeling capability on ImageNet, demonstrating the superior
potential of MaxViT blocks as a universal vision module. We will make the code
and models publicly available
Recommended from our members
Quality prediction and visual enhancement of user-generated content
With the rapid development of streaming media technologies coupled with the explosion of user-generated content (UGC) captured and streamed over social media platforms, such as YouTube and Facebook, videos now play a central role in the daily lives of billions of people. The increased popularity of UGC videos has catalyzed the great need to understand and analyze billions of these shared contents to optimize pipelines of efficient UGC video storage, processing, and streaming. UGC videos, which are typically created by amateur videographers, often suffer from unsatisfactory perceptual quality, arising from any process throughout video acquisition. In this regard, predicting UGC video quality is much more challenging than assessing the quality of synthetically distorted videos in traditional video quality databases. In this dissertation, we will comprehensively investigate the quality prediction and enhancement problems for UGC pictures and videos. We first study a particular artifact, "the banding artifact," that is a common video compression impairment. We approach this artifact by first analyzing the perceptual and encoding aspects of color bands, then build a new distortion-specific no-reference quality metric dedicated to banding visibility. Furthermore, we aim at building a banding artifact removal algorithm by formulating it as a visual enhancement problem. Accordingly, we propose to solve it by applying a form of content-adaptive smoothing filter followed by dithered quantization, as a post-processing module. We also extend this debanding filter by learning a cascaded artifact removal network to jointly remove banding and blocking artifacts, yielding greater visual enhancement. UGC distortions are more diverse, complicated, commingled, and thus no single quality factor can suffice to predict the overall quality. Blindly predicting the perceptual quality of UGC videos are very challenging. We first conducted a benchmark study on recent large-scale UGC video databases using leading popular no-reference video quality metrics, then propose to leverage feature selection to build a new compact video quality model on top of a curated list of previous effective spatial and temporal features from popular VQA models, which we dub VIDEVAL. In addition to this compact model, we also proposed to build an efficiency-oriented fast model for practical purposes called RAPIQUE by combining efficient natural scene statistics features with pre-trained deep learning models. This model would involve designing an aggressive spatial and temporal sampling process to boost its efficiency. Within the model building, we would also explore the temporal statistics of natural videos, which would contribute to pushing forward the performance of VQA models for motion-intensive videos with large camera motion. Next, we study visual restoration and enhancement of pictures degraded by common distortions existed in UGC videos, including noise, blur, low-light, etc. Based on recent progress on Transformer and multi-layer perceptron (MLP) models, we propose an efficient MLP-based vision backbone, which we dub MAXIM, that can effectively restore images suffering from degradation. The core component of MAXIM is called the multi-axis gated MLP block that achieves a local and global spatial interactions in linear complexity. We further extend this idea to high-level vision tasks such as image recognition by proposing another vision backbone called MaxViT. Our extensive numerical and visual experiments have shown that this multi-axis approach provides a strong vision component for both high-level and low-level vision tasks. Finally, we conclude the thesis with some remarks on the current challenges and future directions regarding the UGC video quality prediction and enhancement problems.Electrical and Computer Engineerin
RAPIQUE: Rapid and Accurate Video Quality Prediction of User Generated Content
Blind or no-reference video quality assessment of user-generated content
(UGC) has become a trending, challenging, heretofore unsolved problem. Accurate
and efficient video quality predictors suitable for this content are thus in
great demand to achieve more intelligent analysis and processing of UGC videos.
Previous studies have shown that natural scene statistics and deep learning
features are both sufficient to capture spatial distortions, which contribute
to a significant aspect of UGC video quality issues. However, these models are
either incapable or inefficient for predicting the quality of complex and
diverse UGC videos in practical applications. Here we introduce an effective
and efficient video quality model for UGC content, which we dub the Rapid and
Accurate Video Quality Evaluator (RAPIQUE), which we show performs comparably
to state-of-the-art (SOTA) models but with orders-of-magnitude faster runtime.
RAPIQUE combines and leverages the advantages of both quality-aware scene
statistics features and semantics-aware deep convolutional features, allowing
us to design the first general and efficient spatial and temporal (space-time)
bandpass statistics model for video quality modeling. Our experimental results
on recent large-scale UGC video quality databases show that RAPIQUE delivers
top performances on all the datasets at a considerably lower computational
expense. We hope this work promotes and inspires further efforts towards
practical modeling of video quality problems for potential real-time and
low-latency applications. To promote public usage, an implementation of RAPIQUE
has been made freely available online: \url{https://github.com/vztu/RAPIQUE}.Comment: IEEE Open Journal of Signal Processing 202
CoBEVT: Cooperative Bird's Eye View Semantic Segmentation with Sparse Transformers
Bird's eye view (BEV) semantic segmentation plays a crucial role in spatial
sensing for autonomous driving. Although recent literature has made significant
progress on BEV map understanding, they are all based on single-agent
camera-based systems which are difficult to handle occlusions and detect
distant objects in complex traffic scenes. Vehicle-to-Vehicle (V2V)
communication technologies have enabled autonomous vehicles to share sensing
information, which can dramatically improve the perception performance and
range as compared to single-agent systems. In this paper, we propose CoBEVT,
the first generic multi-agent multi-camera perception framework that can
cooperatively generate BEV map predictions. To efficiently fuse camera features
from multi-view and multi-agent data in an underlying Transformer architecture,
we design a fused axial attention or FAX module, which can capture sparsely
local and global spatial interactions across views and agents. The extensive
experiments on the V2V perception dataset, OPV2V, demonstrate that CoBEVT
achieves state-of-the-art performance for cooperative BEV semantic
segmentation. Moreover, CoBEVT is shown to be generalizable to other tasks,
including 1) BEV segmentation with single-agent multi-camera and 2) 3D object
detection with multi-agent LiDAR systems, and achieves state-of-the-art
performance with real-time inference speed
ROMNet: Renovate the Old Memories
Renovating the memories in old photos is an intriguing research topic in
computer vision fields. These legacy images often suffer from severe and
commingled degradations such as cracks, noise, and color-fading, while lack of
large-scale paired old photo datasets makes this restoration task very
challenging. In this work, we present a novel reference-based end-to-end
learning framework that can jointly repair and colorize the degraded legacy
pictures. Specifically, the proposed framework consists of three modules: a
restoration sub-network for degradation restoration, a similarity sub-network
for color histogram matching and transfer, and a colorization subnet that
learns to predict the chroma elements of the images conditioned on chromatic
reference signals. The whole system takes advantage of the color histogram
priors in a given reference image, which vastly reduces the dependency on
large-scale training data. Apart from the proposed method, we also create, to
our knowledge, the first public and real-world old photo dataset with paired
ground truth for evaluating old photo restoration models, wherein each old
photo is paired with a manually restored pristine image by PhotoShop experts.
Our extensive experiments conducted on both synthetic and real-world datasets
demonstrate that our method significantly outperforms state-of-the-arts both
quantitatively and qualitatively.Comment: Paper major revisio